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Abstract — A key objective of the smart grid is to improve 
reliability of utility services to end users. This requires strength- 
ening resilience of distribution networks that lie at the edge 
of the grid. However, distribution networks are exposed to 
external disturbances such as hurricanes and snow storms 
where electricity service to customers is disrupted repeatedly. 
External disturbances cause large-scale power failures that are 
neither well-understood, nor formulated rigorously, nor studied 
systematically. This work studies resilience of power distribution 
networks to large-scale disturbances in three aspects. First, a non- 
stationary random process is derived to characterize an entire 
life cycle of large-scale failure and recovery. Second, resilience 
is defined based on the non-stationary random process. Close 
form analytical expressions are derived under specific large- 
scale failure scenarios. Third, the non-stationary model and the 
resilience metric are applied to a real life example of large- 
scale disruptions due to Hurricane Ike. Real data on large-scale 
failures from an operational network is used to learn time-varying 
model parameters and resilience metrics. 

Index Terms — Non-stationarity, resilience, non-homogeneous 
Poisson process, mixture model, real data 



I. Introduction 

Our power grid is a vast interconnected network that delivers 
electricity from suppliers to consumers. At the edge of the 
grid lies power distribution networks |jT]. Power distribution 
networks provide medium or low voltages to residence and 
organizations, and thus serve a unique role of connecting the 
grid to end users. 

Distribution networks consist of "leaf nodes" of the energy 
infrastructure and are thus susceptible to external disturbances. 
For example, natural disasters repeatedly cause devastating 
destructions and service disruptions to distributions networks 
Q |[3|. There were 6 major hurricanes and more than 10 
snow and ice storms occurred in north America in the past 
5 years Q. Each natural disaster disrupted power services 
to more than 500,000 customers for days |4|. Large-scale 
power outages are more prevalent and damaging in developing 
countries, whose energy demands are rapidly increasing but 
power infrastructures are still in development |5|. 

A fundamental research problem pertaining to the real 
problem is the resilience of power distribution to large-scale 
external disruptions. Resilience corresponds to the ability of 
distribution networks to withstand external disturbances and to 
recover rapidly from failures. Up to date, tremendous attention 



has been directed to resilience of the core that consists of 
major power generation and transmission systems of high 
voltages |6| |7| |8|. As an external disturbance often affects 
power distribution networks in a wide geo-graphical span, the 
service disruptions usually remain local at the edge |j9|. The 
resilience of the edge, however, is understudied fTO) . As the 
demand on energy is growing, the edge of the grid is becoming 
thicker. For example, a utility provider often serves millions 
of customers in America. Damages from external disturbances 
on power distribution can thus result in a profound impact to a 
large number of users. Hence, the issue of resilience of power 
distribution networks is much needed to be investigated. 

Empirical studies have been conducted on how to access 
damages from large-scale power outages (see flT] and ref- 
erence therein). Monitoring systems have been developed 
and used in power industry to respond to failures due to 
natural disasters (see |12J as an example). These empirical 
methods predict the degree of damages, e.g., the maximum 
number of outages upon a natural disaster |2|, often through 
observations and experiences. Quantifiable approaches are 
lacking and needed to leverage practical experiences to general 
principles on resilience of power distribution Q p3) pO) . In 
other words, resilience, as a concept about an overall power 
distribution network, needs to be learned systematically from 
real data. 

Two research challenges emerge. One is what to learn 
from real data. Real data consists of samples on responses 
of a distribution network to external disturbances. External 
disturbances such as hurricanes often occur suddenly and 
unpredictably. The resulting large-scale failure and recovery 
at distribution networks thus exhibit random and dynamic 
behavior. For example, recovery depends on multiple random 
factors such as environmental conditions, available resources 
and preparation. Prior work models network failures and 
recoveries using finite state Markov processes p4[ . These 
models belong to the general birth-and-death process p3| 
which include randomness but assume stationary failure and 
recovery. Non-stationary failure such as a non-homogeneous 
Poisson point process is provided in the context of Mt/G/oo 
queues |16) p7| with time varying Poisson parameters. Gen- 
eral time-dependent infinite-server arrival/service processes are 



studied in 1 18 1 1 19 1. However, non-stationary recovery is rarely 
included in network resiUence. In fact, it is an open issue 
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how to characterize large-scale non- stationary failure and non- 
stationary recovery for power distribution networks. 

The second challenge is how to define and characterize 
resilience for an entire non-stationary life-cycle of failure 
and recovery. Prior work provides general discussions that 
resilience should include multiple attributes, such as both 
failure and recovery | [20| pT| p2) . However, most prior works 
mainly study resilience in terms of maintaining services, and 
thus consider failures only |23|. In fact, recovery is rarely 
included in defining resilience of power distribution networks. 
It is an open issue how to derive resilience metrics for 
characterizing non-stationarity in both failure and recovery. 

Hence it is necessary to formulate, from ground up, an 
entire life cycle of large-scale failure and recovery. This 
can prevent us from choosing a model and/or a learning 
approach subjectively. We first develop a problem formulation 
of large-scale failures at the finest level of network nodes 
based on temporal-spatial stochastic processes. However, as 
an individual external disturbance results in one "snapshot" 
of network responses, information (data) from one snapshot is 
insufficient to completely specify such temporal-spatial model 
| [24) . We thus derive temporal models of an entire distribution 
network by aggregating spatial variables. The resulting tempo- 
ral process models an entire non-stationary life-cycle of large- 
scale failures. The model applies to general failure process 
that can be dependent and with an arbitrary distribution. Two 
distinct recovery-characteristics emerge from our model. One 
is infant recovery that reflects the ability to recover rapidly 
from failures. The other is aging recovery that corresponds to 
prolonged failures. We define the resilience as the probability 
of infant recovery for a power distribution network. We then 
derive analytical expressions for special cases of failure and 
recovery. 

The model and the resilience metric are studied in a real 
life example of large scale service disruptions of power 
distribution. Power failures occurred during a major natural 
disaster. Hurricane Ike in 2008. Pertinent resilience parameters 
are learned using real data from an operational distribution 
network. 

The rest of the paper is organized as follows. Section [ll| 
provides background knowledge and an example of large- 
scale failures at power distribution networks. Section [III]devel- 
ops a problem formulation of failure-and-recovery processes. 



Section IV characterizes non-stationary failure and recovery 



individually and jointly; then derives analytical expressions 
for special cases of failure and recovery. Section [V] defines 
network resilience based on the non-stationary model. Section 
VI learns pertinent resilience parameters using large-scale real 



data. Section VII discusses our findings and concludes the 
paper. 

II. Background and Example 

A power distribution network is at the edge of the grid from 
substations to users. A power distribution network consists 
of components including substations, feeders, transformers, 
poles, and transmission lines, and meters. 

A large number of such devices in a distribution network are 
often in the open, and thus susceptible to natural disasters such 
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Fig. 1. Empirical distribution of failure durations in 3D. 



as hurricanes, ice and snow storms. For example, a fallen pole 
can cause a short circuit, and other devices to fail subsequently. 
An external disturbance such as a hurricane can cause a large 
number of failures. The failures interrupt electricity service 
to end users. A large-scale external disruption can affect one 
or many distribution networks in a wide geo-graphical area. 
Failure recovery is often done by dispatching crews to the field. 
Promptness of failure recovery thus depends on environmental 
constraints, preparedness, and resources. 

To gain intuition on an entire life cycle of failure and 
recovery, we consider an example of large-scale power failures 
occurred during Hurricane Ike. Hurricane Ike had a landfall 
in 2008 and affected densely populated areas in Texas and 
Louisiana. Figure [T] shows a histogram on failure occurrence 
time and duration at an operational distribution network be- 
fore, during and after the hurricane. The example provides the 
following observations: 

(a) Both failure occurrence and recovery are time-varying, 
i.e., non-stationary. 

(b) Samples on failure occurrence time and duration are not 
identically distributed. Instead, recovery time characterized by 
failure duration is different for failures occurred at different 
time. 

This shows that failure occurrence and recovery are statis- 
tically dependent, and non-stationary. 

III. Stochastic Model 

We formulate large-scale failure and recovery based on 
non-stationary random processes. We begin with the detailed 
information on nodal statuses in a distribution network. We 
then aggregate the spatial variables of nodes to obtain temporal 
evolution of failure and recovery of an entire network. 

A. Failure and Recovery Probability 

A temporal-spatial random process provides a theoretical 
basis for modeling large-scale failures at the finest scale of 
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nodes. The temporal variable is time t. The spatial variable is 
the location of a node in the grid. Here, nodes can be any com- 
ponents in a distribution network such as substations, feeders, 
hubs, transformers, transmission lines, and power circuits. A 
shorthand notation i is used to specify both the index of node 
i and its corresponding location, where i £ S — {1,2, ...,n} 
for a power distribution network with n nodes. 

Let Xi{t) be the status of the i-th node at time t > for 
1 < i < n. Xi{t) = 1 if the i-th node is in a failure mode. 
Xi (t) = if the node is in normal operation. For a distribution 
network, an example of a node failure includes a failed circuit, 
a fallen pole, a broken link, and an non-operational substation. 

Failures caused by external disturbances exhibit random- 
ness. Whether and when a node fails is random. Whether and 
when a failed node recovers is also random. Hence, random 
processes can be used to characterize failure and recovery for 
all nodes in a network. 

Given time t > 0, P{Xi(t + t) = 1} characterizes the 
probability that node i is failed in the near future t + r, where 
T > is a small time increment. Assume a node changes state, 
i.e., from failure to normal and vice versa. Then for the ith 
node, 1 < i < n, 

P{X,{t + T) = l}-P{X,{t) = 1} 
=P{X,{t + T) = l,X,{t) = 0} (1) 
-P{X,(t + T)=0,X,{t) = l}. 

Eq{T] provides a model for an individual node in a network. 
The model includes Markov temporal dependence and spatial 
dependence among nodal statuses. Such a model can be 
applied to a heterogeneous grid where nodes experience in 
general different failure and recovery processes. There are n 
such temporal-spatial equations for n nodes in a distribution 
network. The n equations together form a temporal-spatial 
model for a network. 

B. Temporal Process 

When large-scale outages are caused by individual external 
disturbances, information available is from "snapshots" of 
temporal spatial network statuses. A snapshot corresponds to 
spatial-temporal nodal statuses with respect to one external 
disturbance, e.g., one hurricane. As there are usually only a 
few such snapshots available, information is insufficient for 
specifying a complete temporal spatial model at the node level. 
However, spatial variables can be aggregated out from Eq{T] 
making the information sufficient for temporal characteristics, 
where 

■les ies 

= Y.Pmt + r) = l,X,{t)=0} 

■ies 

-Y,P{Mt + r)^0,X,{t) = 1}. 
ies 

Furthermore, the probability can be related to an indicator 
function, 

P{X,{t) = 1} = E{I[X,{t) = 1]}. (3) 



Then we can define a temporal process as follows. 

Definition: A temporal processes {N{t) e IN, ^ > 0} is a 
special case of the temporal-spatial process where the spatial 
variables (i's) are aggregated for all nodes in a network. N{t) 
is the number of nodes in failure state at time t, 

N{t)=Y,I[X,it) = l], (4) 

ies 

where I{A) is an indicator function, i.e., I{A) = 1 if event A 
occurs, and I{A) — otherwise. 

Combining Equations |2j [3] and |4] we have, 

E{AN{t)} = J2 P{Mt + ^) = 1} - E = (5) 

ies ies 

where AN{t) = N{t + r) - N{t). Hence, an expected 
increment of the number of failed nodes in a network equals 
to the total change of the aggregated probabilities on nodal 
statuses. 

An increment AN{t) in the total number of nodes in failure 
state can result from either newly failed or newly recovered 
nodes. To further characterize the temporal process N{t), we 
define a failure process and a recovery process respectively. 

Definition: Failure and recovery processes: Failure process 
{Nf{t) e IN; t > 0} is the number of failures occurred up to 
time t. Recovery process {Nr{t) e lN,t > 0} is the number 
of recoveries occurred up to time t. 

Assume t > is sufficiently small so that at most one 
failure occurs during {t,t + r), then an increment on the 
number of failures satisfies 

E{ANf{t)} = ^P{X,it + T) = l,X,{t)^0}, (6) 

■ies 

where ANf{t) = Nf{t + r)} - Nf{t). Similarly, for a 
sufficiently small r, it can be assumed that at most one 
recovery occurs during {t,t + r). Then, 

£;{AiV,(i)} = ^P{X,(i + T) = G,X,{t) = 1}, (7) 
ies 

where ANrit) = Nr{t+T)}-Nr{t). Hence, Eq|2]is simplified 
as, 

E{AN{t)} ^ E{ANf{t)} ~ E{ANr(t)}. (8) 

Furthermore, we assume at time to — 0, N{t) — 0, Nf{t) — 0, 
and Nr{t) = 0. Aggregating increments in Eqjsjfrom to t, 
we have, 

E{N{t)}^E{Nf{t)}~E{Nrit)}. (9) 

Hence, the expected number of nodes in failure state 
equals to the difference between the expected failures and 
the expected recoveries. This characterizes how the status of 
a distribution network changes from the present to the near 
future. 
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For practicality, the empirical processes as the sample means 
N{t), Nf{t), and Nr(t) can be used to estimate the true ex- 
pectations E{N{t)}, E{Nf{t)}, and E{Nr{t)}, respectively. 
Eq|9] can then be represented by the empirical processes. 



N{t)^Nfit)-Nrit). 



(10) 



The empirical processes and the failure-and-recovery equa- 
tion allow learning from field data, which shall be elaborated 
in Section |VI] 

IV. Non-Stationary Failure and Recovery 

We now focus on the temporal processes to derive non- 
stationary characteristics on failure and recovery individually 
and jointly. We derive close form expressions for special 
cases of failure and recovery processes. Our study reveals 
pertinent parameters which shall be used to quantify resilience 
in Section IV] 

A. Failure Process 

Let \f{t) be the intensity function of the failure process. 
A / (t) is the expected number of new failures per unit time at 
epoch t, i.e.. 



A/(t) 



lim 



E{Nj{t + r)-Ni{t)) 



(11) 



\f{t) is also referred to as the rate function of the failure 
process Nf{t). The larger Xf{t) is, the more failures occur in 
a unit time duration. Hence, failure rate quantifies the intensity 
of failure occurrence. An non-stationary failure process has 
a time-varying intensity function A/(<). Assuming a failure 
process begins at t = 0, we have 



(12) 



E{Nfit)} = / Xf{v)dv 



The probability density function f{t) of failure occurrence 
time at t can be obtained from A/(t), where 

E{Nf{t + e)-Nf{t)}l 
e 



fit) = lim 



lim„ 



(13) 



/o A/(s)ds 



As a special case of a general failure process, Xf{t), 
as the first moment, can completely determines the failure 
process Nf{t). Consider an independent failure process where 
the number of failures Nf{t) is a counting process with 
independent increment ANf{t). Among many such random 
processes, non-homogeneous Poisson process (NHPP) (15| 
captures the non-stationary nature of failures in a parametric 
form. The parameter is the intensity function, Xf{t). The non- 
stationarity refers to a common characteristic of large-scale 
external disruptions where power failures occur at different 
intensity at different time. The definition of a NHPP, applied 
to a failure process, is provided below. 



Deflnition p3| : A continuous time counting process 
{Nf{t),t > 0} is a Non-Homogeneous Poisson Process with 



a time-varying rate function Xf{t),t > 0, if Nf (t) satisfies 
the following conditions: 
. 7V/(0)=0; 

• {Nf{t),t > 0} has independent increments; 

. P{Nf{t + t) - Nf{t) > 2} = o(t); 

. P{Nf{t + r) - Nf{t) = 1} = Xfit)r + o(t). 

In Section [Vl] we shall show using real data that a failure 
process {Nf{t)} from a hurricane follows a NHPP with Xf(t) 
as the failure rate function. 

B. Recovery Process 

We now define intensity function Xr{t) for a recovery 
process. Xr{t) is the expected number of new recoveries per 
unit time at epoch t, 

A.W=lim^{^Mi±l)}^^Wt)}. (14) 

T-fO T 

Xr{t) is also referred to as the rate function of the recovery 
process Nr{t). 

An non-stationary recovery process has a time-varying 
intensity function, i.e., Xr{t). Assuming the temporal failure 
process begins at to — 0, we have 

E{Nr{t)} = / Xr{v)dv. (15) 



The recovery rate characterizes how rapidly recovery oc- 
curs, which is measured by failure duration D. For an non- 
stationary recovery process, a failure duration depends on 
when a failure occurs as illustrated in Figure [T] Such non- 
stationarity of recovery is characterized by g{d\t) which is 
a conditional probability density function of failure duration 
D given failure time T. For a given threshold do > 0, the 
conditional probability that a duration is bounded by do for 
failures occurred around time t is 



P{D < do\t} 



g{v\t)dv, 



(16) 



where _D is a random failure duration. 

When do is sufficiently small, this probability characterizes 
rapid recovery that occurs shortly after failures. For a given 
do, the larger P{D < do\t} is, the more dominating the rapid 
recovery is. Given desired value of probability P{D < do\t}, 
the smaller do is, the more dominating the rapid recovery is. 

Rapid recovery is referred to as infant recovery. This ter- 
minology is borrowed from infant mortality in survivability 
analysis f25]. Infant recovery is a desirable characteristic of 
the smart grid. In contrast, slow recovery is referred to as 
aging recovery in analogous to aging mortality |26J. Infant 
and aging recovery can be formally defined as follows. 

Deflnition: Infant and aging recovery: Let do > be a 
threshold value. If a node remains in failure for a duration 
less than do; a recovery is an infant recovery. Otherwise, the 
recovery is aging recovery. Infant recovery is characterized by 
P{D < do\t}. Aging recovery is characterized by P{D > 
do 10- 
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Note here P{D < da\t} is a function of failure occurrence 
time. As we shall show through a real-life example in Section 
VI failures occurred at different time may experience infant 
and/or aging recovery of different degrees, showing the non- 
stationarity of a recovery process. 

C. Joint Failure-and-Recovery Process 

A joint failure and recovery process characterizes an entire 
life cycle of a failure-and-recovery process (FRP), and repre- 
sents the total number of nodes N{t) in failure state at time t 
(Eq|4]). The expected number of nodes in failure can be written 
in terms of rate functions, 



E{N(t)}^ / [\j{v)-K{v)]dv. 



Failure-and-recovery process can be viewed as an alternative 
form of the birth-and-death process | |15J . However, commonly- 
used birth-and-death processes have a stationary distribution of 
failure duration and assume independence between failure oc- 
currence t and failure duration d. Here, these two assumptions 
are not needed. This implies that failures occurred at different 
time can last different duration. For example, under strong 
and sustained hurricane wind, failures that do not happen in 
day-to-day operation can occur due to, e.g., falling debris and 
power lines. We shall further elaborate this through the real- 



life example in Section VI 

A failure process and a recovery process are related by 
failure durations. As a special case when failure follows a non- 
homogeneous Poisson process, the following theorem shows 
that the recovery process is also a non-homogeneous Poisson 
process parameterized by recovery rate Xr{t). In addition Xr{t) 
can be expressed by failure rate \f{t) with the help of the 
distribution of failure duration g{d\t). 

Theorem Assume 

(a) Independent failure occurrence {T} obeys a non- 
homogeneous Poisson process {Nf[t)} with intensity function 

(b) Failure duration {D} follows a conditional probability 
density function g{d\t) for d > 0, t > 0. 

Then the recovery time {T + D} is drawn from a non- 
homogeneous Poisson process {Nr{t)} with recovery intensity 
function. 



Kit) 



Xf{s)g(t — s\s)ds. 



(18) 



The proof of the theorem is given in Appendix |A] 

Hence, if a failure process is non-homogeneous Poisson, our 
general formulation of the temporal random process reduces 
to a Mt/Gt/oo queue. Mt indicates time-dependent Poisson 
failures. Gt indicates non-stationary distribution of failure 
duration, oo is for the infinite number of servers for repair; 
thus recovery can occur right after failure. Furthermore, when 
g{d\t) = g{d) is stationary, Xr{t) reduces to the convolution 
of the failure rate and service time distribution g{d). The 
Mt/Gt/oo model reduces to Mt/G /oo queue developed in 



D. Special Cases 

We consider two special cases where failure, recovery and 
joint processes exhibit simple analytical expressions. These 
expressions provide insights to an entire non-stationary life 
cycle of failure-and-recovery process under different failure 
scenarios. 

Case 1: Failure and recovery in day-to-day operation: 
When there are no significant external disruptions, power 
outages are assumed to occur randomly and sporadically. The 
failure intensity remains at a constant level, i.e., Xf{t) — Aq, 
where Aq > does not vary with time. The number of failures 
Nf{t) thus follows a homogeneous Poisson process in day- 
to-day operation. The recovery rate in day-to-day operation 



Ar.o(i) in Eq 18 reduces to. 



Xr,oit) ^ K g{t - s\s)ds. 



(19) 



As t ^ oo, Xrfi{t) — > Aq. This suggests that the recovery 
process in day-to-day operation is also a homogeneous Poisson 
process at a long time horizon. Then the expectation E{Nf){t)} 
of joint failure-and-recovery process in day-to-day operation 
is 



E{Noit)} 



[Xo - Xrfi{v)]dv. 



(20) 



Case 2: Surging failure during a natural disaster: 
Now consider a disaster scenario where failures occur 
suddenly and intensely. Xf{t) increases from a small value 
Aq to a large value in a short time duration < t < ti. The 
failure rate can then be written as. 



Xf{t) = X,n{t)[u{t) - u{t - ti)] + Ao, 



(21) 



where u{t) is the unit step function, maxj X„i{t) ^ Aq. This 
corresponds to an extreme case when a disaster causes sudden 
failures at time and then weakens right after 



A,.(t) in Equation 18 becomes, 

/'min(t,ii ) 



Xrit) 



Xrn{s)git - s\s)ds + Xrflit). (22) 







When t ^ ti, Xr{t) « Aq. Furthermore, when ti is 
sufficiently small, i.e., a surge of failures upon the disaster 
only lasts a short time. 



Xr{t) = X„,{0)g{t\0)mm{t,ti}u{t) + Xr,o{t). 



(23) 



When the peak of surging failures A„i(0) ^ Aq, the recovery 
rate in Eq|23] is approximately proportional to the surging 
failure rate and distribution of recovery time immediately after 
the disaster In the long run, Xr{t) reduces to Aq. 

Substituting Eq|2T|and[23]to Eq{T7] we have the expectation 
of the failure-and-recovery process in surging failures as, 

E{N{t)} = A„(0)[1 - Git\0)] min{t, t^} + E{No{t)}, (24) 

where G{t\s) — g{v\s)dv is the conditional cumulative 
density function (cdf) of failure duration given time t. 
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Additional characteristics of recovery can further reduce the 
above expressions. When infant recovery completely domi- 
nates a recovery process, g{d\Q) ^ when d > do. Hence, 
for an impulse-like surge of failures at t 0, the resulting 
recovery from the surge lasts do duration, i.e., 

Xr{t) = A„(0)s(t|0) min{t, ti}[u(t) ~ u{t - do)] + \rfi(t). (25) 

Eq 24 in dominating infant recovery becomes. 



E{N{t)} 
/ A^(0)[l~G(i|0)]min{t,ti}- 
l E{No{t)}, 



■E{No(t)}, 



0<t<do, (26) 
t > do- 



Here failure duration do is assumed to be larger than the 
end of failure process ti to simplify the expression. Eq 25 



and Eq 26 show that when a recovery process consists of only 



infant recovery, all failures due to disasters recover by do after 
the failure eruption. After do the distribution network resumes 
day-to-day operation. 

When aging recovery dominates the recovery process, 
g{d\0) « for d < do- This implies that recovery begins with 
delay do after a surge of failures. The corresponding recovery 
rate reduces from Equation |23] where 

Kit) = A,„(0)g(t|0) min{t, t^juit - do) + Xr,o(t). (27) 

Eq[24l in dominating aging recovery becomes, 

E{Nit)} = 

Xm{0)mm{t,ti} + E{No{t)}, 0<t<do, 
Mti[l-Git\0)]+E{No{t)}, t>do. 



(28) 



A. 



Here we also assume do > ti. Eq|28] shows another extreme 
case where recovery does not begin until do delay from the 
failure eruption. Then the failures start to recover slowly. 
Hence, aging recovery shows the difficulty in resuming service 
to customers, and is thus an undesirable characteristic for the 
smart grid. 

In the general failure-and-recovery process, recovery begins 
as soon as failures occur During the disaster, the failure pro- 
cess dominates and E{N{t)} increases rapidly. Afterwards, 
the recovery process dominates. At the large time scale and 
when disaster lasts for a short period of time, the overlap 
between the failure process and recovery process can be 
neglected. Thus the failure-and-recovery process splits into 
individual processes, failure and recovery, respectively. 

V. Resilience 

We now derive network resilience using the pertinent pa- 
rameters for an entire life cycle of non- stationary failure and 
recovery. 

A. Definition 

An intuition on how to define resilience results from the 
previous section. Resilience should be characterized by viru- 
lence of failures A/(i), speed of recovery g{d\t) and threshold 
do- These three parameters determine infant recovery. A distri- 
bution network experiences a combination of infant and aging 
recovery in general. A network is intuitively more resilient 



when exhibiting more infant recovery. Hence we define the 
resilience as the probability of infant recovery. 

Definition: Resilience: Given a threshold value do on failure 
duration, the resilience of a power distribution network is 
defined as 

s(do) = P{D < do}. (29) 

At a high-level, s{do) measures the resilience of the grid 
from large-scale disruptions, and reflects the ability for a distri- 
bution network as a whole to survive large-scale failures. Here 
we use P{D < do} rather than time dependent conditional 
probability P{D < do\t}. This is because resilience, when 
viewed as a network-wide quantity, should include an entire 
life-cycle of failure and recovery but not depend on specific 
time epoch t. 

B. Resilience Parameters 

Resilience can be expressed explicitly by the three pertinent 
parameters as follows. 



3(do) =Et{G{do\t)} 



P{D < do\t}f{t)dt 



(30) 



do 



g{v\t)f{t)dvdt, 

) 

where G{do\t) = P{D < do\t}, and f{t) is the probability 



density function of failure time given in Equation 13 Eq|30| 



results in an alternative expression for resilience. There, s{do) 
is the expected value of the probability of infant recovery 
G{do\t) averaged over the non-stationary failure process. This 
expression shows explicitly how resilience is determined by 
the three pertinent parameters. 

C. Threshold 

do is one other pertinent parameter that determines the 
resilience. For a given value of do, if infant recovery dom- 
inates the recovery process over the entire failure duration, 
s{do) — >■ 1. If aging recovery dominates the recovery process 
over the entire failure process, s{do) — > 0. Thus, a larger 
s{do\t) represents a more resilient power distribution network. 
When s{do) = 0.5, the infant and the aging recovery take 
equal weight. 

How to determine the value for do? do can result from prac- 
tical considerations or customer requirements. For example, 
when recovering within 24 hours is regarded as acceptable for 
a disaster scenario, do = 24 defines infant recovery. 

A value for do can also be determined more objectively. An 
intuition results from the fact that if a network is dominated 
by infant recovery, most failure durations are less than do. 
Hence, the slope of s{x) is relatively large for a; < do and 
relatively small for x > do. This implies that do corresponds 
to a pertinent change point of the slope of s{x), i.e., a deep 
valley in the second derivative -^s{x). This is illustrated in 
in Fig|2];a). In contrast, when a network is dominated by aging 
recovery, most of the failure durations are larger than do. Thus 
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Fig. 2. Illustration: quantitative method for determining cLq. 
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Fig. 3. Histogram of failure occurrence times during Hurricane Ike. 



the slope of s{x) is small for x < and large for x > do. do 
then corresponds to a positive peak in -^s{x) as illustrated 
in Fig|2];b). Hence, do is determined by the largest magnitude 
of the second derivative of the s{x), 



^ ^ I argmin^{^s(a;)}, s{x) is concave, 
I argmax^{-^s(a;)}, s{x) is convex. 



(31) 



We shall provide an example using real data in Section VI 



VI. Large-Scale Outages due to Hurricane Ike 

We now apply the above framework on non-stationary 
failure-recovery processes to a real-life example of large-scale 
utility-service disruptions caused by a hurricane. Our focus 
is on using real data to learn the three resilience parameters 
Xf{t), g{d\t) and do and then to estimate the resilience of an 
operational power distribution network. 



A. Real Data and Processing 

Hurricane Dee was one of the strongest hurricanes occurred 
in 2008. Ike caused large scale power failures, resulting in 
more than 2 million customers without electricity, and marked 
as the second costliest Atlantic hurricane of all time f2J\ ||28). 

Reported by National Hurricane Center p9| , the storm 
started to cause power outages across the onshore areas in 
Louisiana and Texas on September 12, 2008 prior to the 
landfall. Ike then made a landfall at Galveston, Texas on 
2:10 a.m. Central DayHght Time (CDT), September 13, 2008, 
causing strong winds, flooding, and heavy rains across Texas. 
The hurricane weakened to a tropical storm at 1:00 p.m. 
September 13 and passed Texas by 2:00 a.m. September 14. 

Widespread power outages were reported across Louisiana 
and Texas starting September 12 |28|. A major utility provider 
collected data on power outages from more than ten counties. 



The outages include various component failures in a distribu- 
tion network such as failed circuits, fallen poles, and non- 
operational substations. The raw data set consists of 5152 
samples. Each sample consists of the failure occurrence time 
(ti) and duration (di) of a component (i) in a distribution 
network. The accuracy for time i is a minute. The data set 
contains failures occurred from September 12 through 14, 
2008. 

The 5152 samples on the failure occurrence time are plotted 
in Fig|3] where the length of each bin is two hours. There were 
significantly more failures occurred from 7 a.m. September 
12 to 4 a.m. September 14, during which Hurricane Ike made 
the landfall in Texas. There are 2005 samples in this time 
period. Those 2005 samples are considered as failures due to 
Hurricane Ike. 

Furthermore, the data set contains groups of failures that 
occurred within a minute. As a minute is the smallest time 
scale for each sample, the groups are considered as dependent 
failures. Dependent failures are grouped as one failed entity 
(z), with a unique failure occurrence time ti and duration 
di. After such preprocessing, the resulting data set has 465 
failed entities. Two outliers with negative failure duration are 
further removed. The remaining 463 failed entities from 7 am 
September 12 to 4 am September 14 are referred to as nodes. 
D — {^i, di}i=i is the data set we use for the rest of the paper 



B. Empirical Failure Process 

We now focus on studying the empirical failure process 
using data set. 

1) Learning failure rate: First, we use the data set to 
learn failure rate function Xf{t). The empirical rate function 
is estimated using a simple moving average |30|: A/(t) = 

Nf{t+T)~Nf(t- 



2t 



where r > 0. The resulting rate function is 
overlaid with the samples on the number of failures Nf{t) in 
Figure |4] where each bin is of duration 1 hour, r = 5 hours. 
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Fig. 4. Histogram of failure occurrence times and the failure rate ^f{t) 
during Hurricane lice. 



Fig. 5. Quantile-Quantile plot on occurrence time of power outages. 



The failure rate function shows a time-varying, i.e., non- 
stationary, intensity of new failure occuiTence, and can be 
described as follows. 

• Prior to 7 p.m. September 12, the intensity was low, i.e., 
fewer than 5 new failures occurred per hour. Hence Aq = 
5 where Aq is considered as the failure rate in day-to-day 
operation. 

• At 7 p.m. September 12, the intensity increased sharply 
first to 25 new failures per hour. In the next 6 hours, 
the intensity reached the peak value of nearly 50 new 
occurrences per hour. This is consistent to the weather 
report |29| that the strong wind about 145 mph and 
flooding impacted the onshore areas even prior to the 
landfall. The time when the peak occurred coincides with 
the landfafl at 2:10 a.m 9/13 CDT 

• After staying at the high level for about 12 hours (from 7 
p.m. September 12 to 7 a.m. September 13), the intensity 
decreased rapidly back to a low level of less than 5 new 
failures per hours. 

2) Non-Homogeneous Poisson Model: We now consider a 
hypothesis that these 463 failure occuiTences are governed 
by a non-homogeneous Poisson process (NHPP). We perform 
Pearson's test |31 1 on hypothesis i/o that the empirical failure 
process Nf{t) follows a non-homogeneous Poisson Process 
(NHPP) with intensity function Xf{t). The test is on two 
aspects: (a) independence of the outages occurred at the large 
time scale of minutes, and (b) the sample mean, i.e., the 
empirical rate function Xf{t), is sufficient for characterizing 
the failure process. 

We divide the time duration from 7 a.m. September 12 to 
4 p.m. September 14 into 400 intervals. In each interval, the 
number of new failures is compared with the expectation. The 
sum of the square errors from all intervals results in a chi- 
square statistic. The chi-square statistic is = 0.79, with a 
degree of freedom of 2. Given a confidence level at 95%, a 



Xq 05 2) ~ 0.95. The chi-square statistic obtained from the 
data is below the threshold x^ < Xo 05 2- Hence hypothesis 
Hq is not rejected. The detailed procedure of Pearson's test is 
given in Appendix [B] 

However, not rejecting Hq is insufficient for accepting the 
hypothesis. The goodness of fit of NHPP to the data is further 
validated through Quantile-Quantile (QQ) plot given in Figure 
|5] There, the samples are compared with an non-homogeneous 
Poisson process with intensity function Xf{t). The figure 
shows that the non-homogeneous Poisson process with the 
learned intensity function indeed exhibits a good fit to the data. 
Based on the result from Pearson's hypothesis testing and the 
QQ plot, these 463 power failures occur independently, and 
obey a non-homogeneous Poisson distribution. 

C. Empirical Recovery Process 

We now learn the empirical recovery process characterized 
by g{d\t), the conditional probability density function of 
failure duration given failure occurrence time. Our objective 
here is to identify infant and aging recovery. 

1) Data: The 463 samples on the failure durations in our 
data set correspond to the failures occurred from 7 a.m. 
September 12 to 4 p.m. September 14. These samples result in 
a joint empirical distribution g{d,t) depicted in the 3-D Fig|r| 
Each bin is of length (initial failure time) of 1 hour and width 
(duration) of 4 hours. The height of each bin located at initial 
failure time t and failure duration d, represents the number of 
failures that occur at t and last for d. 

2 ) Mixture Model: Given failure time t, we select a mixture 
model as the probability density function g{d\t) for duration 

d>0. 



gid\t) 



i(t) 

E 



Pj{t)gj{d\t), 



(32) 



threshold value is obtained as Xq ( 



5.99, where Pr{x < 



where l{t) is the number of mixtures at time t, Pj{t) (1 < j < 
I) is a weighting factor for the jth mixture function gj{d\t). 
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TABLE I 

Estimated parameters of the five homogeneous Weibull 
distributions. 
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Fig. 6. Empirical distribution of failure duration corresponding to the failures 
occurred in Vi- 



and J2 Pji't) = 1- All these quantities vary with failure time t 
for a non-stationary recovery process. 

A mixture model is chosen since its parameters exhibit in- 
terpretable physical meaning. A parametric family of Weibull 
mixtures is particularly appealing as the parameters correspond 
to infant and aging recovery directly 1 15|. Weibull distributions 
have been widely used in survival analysis p6) | [25j and 
reliability theory |15| , but not in characterizing recovery from 
large-scale disturbances and resilience parameters. Specifi- 
cally, a Weibull distribution is 

9,(d\fn,(t),k,it)) = M!l(^)^,W-ie-'^''^'", (33) 
7j(t) 7j(<) 

where d > 0, kj{t) and 7j(t) are the shape and scale 
parameters respectively. A shape parameter kj (t) is especially 
important for determining the type of recovery signified by 
a mixture component. The smaller kj{t) is, the faster the 
decay rate of gj{d\t), the shorter the failure duration and 
thus the faster the recovery. Hence, kj{t) < 1 corresponds 
to infant recovery whereas kj{t) > 1 corresponds to aging 
recovery. Weighting factor Pj{t) signifies the importance of 
a component gj{d\t). For a non-stationary recovery process, 
these parameters are varying with failure time t. 

3) Learning Mixture Parameters: The parameters of the 
mixtures of Weibull distribution are learned from the data. 
For simplicity, we use a piecewise homogeneous function 
to approximate g{d\t). We divide the failure time into m 
intervals. Within an interval tpi, g{d\t S ipi) — gi{d) is 
assumed to be a time homogeneous function that does not 
vary with failure time t. For different intervals, g{d\t £ ipiYs 
have different parameters for non-stationarity, where 

h 

g{d\t e tpi) = ^ pijgij{d; jij,kij). (34) 
j=i 

Referring to Fig{T] we divide the time into 5 intervals, 
{ipi,i — 1,2,3,4,5}, the boundaries of each interval are 
depicted by dashed lines in Fig[T] 



Within each interval, the parameters of the mixture Weibull 
distribution are learned through maximum likelihood estima- 
tion 1 32 1, and shown in Table |l] Each mixture represents 
one cluster of failure durations. For example, distribution of 
durations in is depicted in Fig|6] where we observe 3 
clusters. 

For different intervals, the failure duration exhibits distinct 
distribution. For example, i/ji (7 a.m. September 12 to 7 p.m. 
September 12) corresponds to the time before hurricane. In this 
interval, the network was not yet impacted by hurricane Ike 
and was under day-to-day operation. Most failure durations 
were as short as a few hours. ■02 (7 p.m. September 12 to 
3 a.m. September 13) corresponds to the time right before 
the landfall and hurricane began to cause large-scale failures. 
In this interval, more prolonged failures occur and recovery 
became more difficult than day-to-day operation. 

D. Overall Failure-and-Recovery Process 

We now compare the empirical failure-and-recovery process 
{N{t)} with the learned process N{t). N{t) is obtained by 
directly adding up the number of failed nodes from the actual 
data. N{t) is obtained by reconstructing the temporal process 
with learned A/(i) and \r{t) through EqjlT] Fig]?] shows the 
comparisons. The dotted N{t) is obtained with the assumption 
that g{d) is stationary over time. The dashed N{t) is obtained 
with the piecewise stationary g{d\t) given in Table [l] All 
estimated processes are able to capture the trend in the data. 
However, the stationary distribution of failure durations g{d) 
deviates significantly from the actual sample path N{t). This 
shows that the piecewise stationary g{d\t) better approximates 
the actual failure-and-recovery process. 

E. Resilience 

To evaluate resilience, we first calculate the probability of 
infant recovery within each interval ipi as shown in Table |l] 
The resilience curve s{x) is then obtained by averaging these 
probabilities over failure time. Figjsjdepicts s{x) together with 
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Fig. 7. Comparison between tlie joint failure-recovery process Af(t) from 
the data set and the reconstructed process Af(t) using learned parameters. 



Fig. 8. Resilience curve of the power network. 



its second derivative s[x) is a concave function. 

Hence, the threshold is obtained as, 



do = ai-gmin3,{-^s(a;)} « 13. 
we compute the resilience 

s(13) = Et{G{\'i\ty\ = 0.4722. 



(35) 



(36) 



Hence, 47.22% of the recoveries occurred within 13 hours 
whereas 52.78% occurred later than 13 hours. Such resilience 
close to 0.5 reveals the combined nature of infant and aging 
recovery in the distribution network. Specifically, s(13) < 0.5 
indicates a slight dominance of aging recovery over infant 
recovery. 

VII. Discussion and Conclusion 

A non-stationary random process has been derived to 
model large-scale failure and recovery of a power distribution 
network under an external disturbance. The non-stationary 
network failure and recovery are characterized by time-varying 
failure rate and probability distribution of recovery time. These 
two quantities, together with a threshold on recovery time, 
define network resilience as the probability of rapid recovery. 
Analytical expressions have been derived to characterize the 
non-stationary failure and recovery under extreme conditions. 

Real data has been obtained from an operational distribution 
network for a large-scale external disturbance. Hurricane Ike. 
Model parameters of the failure-recovery process have been 
estimated through non-stationary learning using the real data. 
Pearson's statistical test is applied to validate the assumptions 
of the failure model. The resilience metric has been evaluated, 
and shown for the operational network that 47% of 5000+ 
failures recovered within 13 hours whereas the remaining 
prolonged as long as 12 days. This provides a quantitative 
measure of the resilience of the distribution network and a 
baseline for improvement. 



How to improve resilience to reduce failures needs to be 
further incorporated in the resilience metric. Non-stationary 
learning algorithms need to be further developed for on-line 
learning of time-varying data. 
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Appendix A 

Proof of Theorem: In order to prove that the recovery 
process {7Vr(^)} is a Poisson process, we just need to show 
the increment of recovery process, i.e., recoveries occurs in 
(t,t + r), is an independent Poisson random variable. Now 
consider non-overlapping intervals 0\, Ofe. We say a failure 
is type i, i ~ l,...,k, if it recovers in the interval O^. The 
number of the type i events equals to the number of recoveries 
occur in the interval Oi. By Proposition 5.3 in p3] , it follows 
that the number of recoveries occur in the interval Oi, i.e., 
the increment of recovery process, is an independent Poisson 
random variable with mean. 



i?{recoveries in O,} 



Xf{s)Pi{s)ds, 



(37) 



Thus, 



where Pi{s) is the probability that a failure is type ^. 
we prove that {Nr{t)} is a Poisson process. 

Then we compute Xr{t) and show the non-homogeneity. 
Consider interval {t, t + t) and a failure occurs at time s, 
where s < t. The probability that this failure recovers during 
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(t, t + t) is G{t + T - s\s) - G{t - s\s). Eg |37] becomes, 
E{Nr{t + t) - Nr{t)} 

\f{s)[G{t + T-s\s)-G{t-s\s)]ds 

/ A/(s)g(i - s|s)ds + o(t), 
Jo 

where G{t + T- s\s) - G{t - s\s) = g{t - s\s)t + o(t). We 
can rewrite Eq 38 as, 
ft 



/ Xf{s)g{t — s\s)ds=\im 
Jo 



E{Nr{t + t) - Nrit)} 



(39) 



The right-hand-side is just Xr{t) (Eq[T4|). Hence, the recovery 
rate function \r{t) is. 



Kit) 



Xf{s)g{t — s\s)ds. 



(40) 



Due to the non-homogeneity of {Nf{t)}, Xf{t) is a time- 
varying function, so X,.{t) is also time-varying function. 
Hence, we show the non-homogeneity of {Nr{t)}, which 
completes the proof. ■ 

Appendix B 

Pearson's Hypothesis Test: The hypothesis test is based on 
a chi-square statistic which compares the failure occurrence 
times with their sample mean. The details of testing Hq that 
failure occurrence times are drawn from a NHPP {Nf{t)} are 
given here. 

1. Compute the estimated failure rate Xf{t). At time t, count 
the number of failure occurrences in [t—r, t+r), then A/(t) = 

Nf{t+T)-Nf{t-T) 

2t 

2. Divide the failure occurrence times into m intervals. 
Count the number of failure occurrences in each interval. 
Let Ci denote the number of occurrence in interval i, where 
i = 1, 2, m. Let k = maxci, i = 1,2, ...m. 

3. Count Oj, which is the observed number of intervals with 
j failure occurrences, where j = 0,1, k. 

4. Use the estimated Xf{t) to compute Ej, which is the 
expected number of intervals with j failure occurrences. 

Ej = X^i'ii j\ ' where (s^, ti) is the ith time interval, 

5. Compute the sum = Sj=o ^'^^ 'e^' ^ ■ ^ ^ 
square statistic, with degree of freedom do] = fc— (number of 
independent parameter fitted)—!. Since one parameter A/(t) 
is fitted, the degree of freedom is fc — 2. 

6. Given a confidence level, for instance 95%, we obtain 
a threshold value Xo 05 do/- hypothesis is rejected if 
^ ^0 05 do/' Otherwise, Hq cannot be rejected. 
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